dissimilarity measure
- Europe > France (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
Grounding Representation Similarity Through Statistical Testing
To understand neural network behavior, recent works quantitatively compare different networks' learned representations using canonical correlation analysis (CCA), centered kernel alignment (CKA), and other dissimilarity measures. Unfortunately, these widely used measures often disagree on fundamental observations, such as whether deep networks differing only in random initialization learn similar representations. These disagreements raise the question: which, if any, of these dissimilarity measures should we believe? We provide a framework to ground this question through a concrete test: measures should have \emph{sensitivity} to changes that affect functional behavior, and \emph{specificity} against changes that do not. We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components. We find that current metrics exhibit different weaknesses, note that a classical baseline performs surprisingly well, and highlight settings where all metrics appear to fail, thus providing a challenge set for further improvement.
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.15)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (21 more...)
- Europe > France (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- Europe > France (0.04)
- North America > United States > California > Alameda County > Berkeley (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.15)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- (21 more...)
Soft-ECM: An extension of Evidential C-Means for complex data
Soubeiga, Armel, Guyet, Thomas, Antoine, Violaine
Clustering based on belief functions has been gaining increasing attention in the machine learning community due to its ability to effectively represent uncertainty and/or imprecision. However, none of the existing algorithms can be applied to complex data, such as mixed data (numerical and categorical) or non-tabular data like time series. Indeed, these types of data are, in general, not represented in a Euclidean space and the aforementioned algorithms make use of the properties of such spaces, in particular for the construction of barycenters. In this paper, we reformulate the Evidential C-Means (ECM) problem for clustering complex data. We propose a new algorithm, Soft-ECM, which consistently positions the centroids of imprecise clusters requiring only a semi-metric. Our experiments show that Soft-ECM present results comparable to conventional fuzzy clustering approaches on numerical data, and we demonstrate its ability to handle mixed data and its benefits when combining fuzzy clustering with semi-metrics such as DTW for time series data.
Ranked differences Pearson correlation dissimilarity with an application to electricity users time series clustering
Charoensuk, Chutiphan, Wiroonsri, Nathakhun
Time series clustering is an unsupervised learning method for classifying time series data into groups with similar behavior. It is used in applications such as healthcare, finance, economics, energy, and climate science. Several time series clustering methods have been introduced and used for over four decades. Most of them focus on measuring either Euclidean distances or association dissimilarities between time series. In this work, we propose a new dissimilarity measure called ranked Pearson correlation dissimilarity (RDPC), which combines a weighted average of a specified fraction of the largest element-wise differences with the well-known Pearson correlation dissimilarity. It is incorporated into hierarchical clustering. The performance is evaluated and compared with existing clustering algorithms. The results show that the RDPC algorithm outperforms others in complicated cases involving different seasonal patterns, trends, and peaks. Finally, we demonstrate our method by clustering a random sample of customers from a Thai electricity consumption time series dataset into seven groups with unique characteristics.
- Asia > Thailand (0.05)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > New York (0.04)
- (3 more...)
Estimating the Optimal Number of Clusters in Categorical Data Clustering by Silhouette Coefficient
Dinh, Duy-Tai, Fujinami, Tsutomu, Huynh, Van-Nam
The problem of estimating the number of clusters (say k) is one of the major challenges for the partitional clustering. This paper proposes an algorithm named k-SCC to estimate the optimal k in categorical data clustering. For the clustering step, the algorithm uses the kernel density estimation approach to define cluster centers. In addition, it uses an information-theoretic based dissimilarity to measure the distance between centers and objects in each cluster. The silhouette analysis based approach is then used to evaluate the quality of different clusterings obtained in the former step to choose the best k. Comparative experiments were conducted on both synthetic and real datasets to compare the performance of k-SCC with three other algorithms. Experimental results show that k-SCC outperforms the compared algorithms in determining the number of clusters for each dataset.
- Asia > Singapore (0.04)
- Asia > Japan (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- North America > United States > California > Alameda County > Oakland (0.04)